Project: FordGoBike_System Data Exploration

BY CHRISTIAN OLUOMA

Table of Contents

Introduction

Dataset Description

This dataset includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.

Question for Analysis

Data Wrangling

Data Gathering

Data Assessing

The dataset has 183,412 riding entries and 16 features.

Some features like start_station_id, start_station_name, end_station_id, end_station_name, member_birth_year and member_gender tend to have NaN(Not a Number) values.

Some features have incorrect datatype and should be changed.

There are no duplicate entries in the dataset.

The number of missing values in the following features; start_station_id, start_station_name, end_station_id, end_station_name, member_birth_year and member_gender can be seen.

The number of unique values for each features can be seen.

From the summary statistics of the dataset, there tend to be outliers in the member_birth_year feature since users born in the year 1878 wont be able to ride bikes. These outliers need to be sorted and removed.

Values above the upper quartile of the duration tend to be very large. We need to convert the duration from seconds to minutes.

The start_station_id, end_station_id and bike_id should be converted to a string datatype or rather be removed since they add no value to our analysis.

Males seems to be more than the female and other gender.

Observations

Quality issues

Tidiness issues

Data Cleaning

Quality issues

No 1: The datatype of 'start_time' and 'end_time' should be datetime.

Using pandas to_datetime function.

No 2: The datatype of 'bike_id', 'start_station_id' and 'end_station_id' should be object.

Using astype function.

No 3: Missing rows in 'start_station_id', 'start_station_name', 'end_station_id', 'end_station_name' should be dropped.

Using a defined function.

No 4: Missing rows in 'member_birth_year' and 'member_gender' should be dropped.

Using a predefined function.

No 5: The datatype of 'member_birth_year' should be int64.

Using the astype function.

No 6: The duration should be in minutes and remove outliers

Divide by 60 to convert seconds to minute and use quantile(), between() and drop() to remove outliers.

No 7: Remove outliers from member's birth year and represent each year with the member's age.

Tidiness Issues

No 1: Day, Month and Year should be extracted from the 'start_time' and 'end_time' and should be placed into individual columns.

Drop Unwanted Columns

The dataset is ready for use.

Data Storage

Exploratory Data Analysis

Univariate Exploration

Looking at the distribution of the main variable of interest: duration_min.

The duration in minutes take on a range of values, from about 3 mins to about 20 mins. It is seen that most trips had duration between about 5 mins to about 8mins. The number of trips decreased gradually as the duration increases from about 8 mins to about 20 mins.

There was a sharp increase in the number of trips between about 3 mins and about 5 mins.

Lets consider another variable of interest: member_age

The age distribution in the data is skewed to the right since a little number of users are old(between age 50yrs to about 80yrs). User between the age of about 25yrs to about 35yrs seem to be high in population.

Lets consider another variable of interest: user_type

Subscribers seem to be more in number as compared to customers.

Lets consider another variable of interest: member_gender

Males seems to be more than the female and other gender.

Lets Consider another variable of interest: start_hour

Most Users seem to ride at the 8th and 17th hour of the day with few rides recorded at the early hours of the day.

Bivariate Exploration

Lets consider the relationship between member_age and duration_min.

There is a wide range of Users from about 18yrs to about 80yrs. Users between the age of 18yrs - 40yrs spends more time riding with a high concentration of Users between 25yrs to 35yrs. Most Users between 25yrs to 35yrs spend about 10 mins riding with more concentartion on 6 mins.

Lets consider the relationship between member_gender and duration_min.

Females spend an average of about 9 mins riding while males spend an average of about 8 mins riding, a little lower than the females. The other gender spends an average of 9mins riding. The Upper quartile of female riding duration is about 13 mins, for male its about 11 mins and for the other gender, its about 12 mins. All gender have their minimum and maximum riding duration at about 3 mins and about 19 mins respectively.

Lets consider the relationship between user_type and duration_min

Subscribers spend an average of 9 mins riding while Customers spend an average of about 11 mins riding. Customers tend to have higher riding duration than subscribers. The upper quartile of customers riding duration is a little below 15 mins while the upper quartile of subcribers riding duration is a little below 12.5 mins. Subscribers with riding duration between 5 mins and 7.5 mins tend to be concentrated. The customers are normally spread between the minimum duration and the maximum duration unlike the subscribers with few people having riding duration above 15 mins.

Lets consider the relationship between member_gender and user_type

Among the gender of the user, more users tend to be subscribers.

Lets consider the relationship between Start_hour and Average duration

At the start of the day, the riding duration takes an haphazard look with it peaks (9.6 mins and 9.8 mins) at the 3rd and 8th hour of the day respectively. The hour with the least duration is the 4th hour of the day. From the 17th hour having a peak of 9.5 mins, there is a steady decline in riding duration till the end of the day. People tends to spend more time riding between the 5th and 9th hours of the day.

Lets consider the relationship between Day of Month and Average duration

In the month of february, 2019, Day 13 recorded the least average riding duration while Day 23 recorded the highest average riding duration. The average duration recorded in other days of the month tends to range between 8.8 mins and 9.5 mins.

Multivariate Exploration

Lets consider the relationship between Gender, Day and Average duration

The male gender tend to have the least average duration per day of the month ranging from 8.5 mins to 9.5 mins. The female gender has an average duarion per day ranging from 9.0 mins to a little below 10.5 mins. The other gender has a wude range of average duarion per day from 8.5 mins to a little below 11 mins.

Lets consider the relationship between User Type, Gender and Average duration

Among all the gender, the customers spends more time riding with an average duration above 10 mins. Female customers tend to have the highest duration of about 11 mins.

Lets consider the relationship between all numerical features.

The latitude and longitude correlate negatively with a correlation coeficient of 0.69. Other numerical features have little ccorrelation with each other making them suitable in creating a model that aids in predicting duration.

Lets consider the Effect of Location on the Duration.

The stations seems to be distributed along the coastal areas of san francisco (San Francisco Bay) with fewer station having high duration between 16mins and 18mins and most stations having duration between 8 mins to 12 mins.

Conclusion

When are most trips taken in terms of time of day, day of the month?

How long does the average trip per hour take?

Does the length of the average trip depends on if a user is a subscriber or a customer?